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Abstract 

We present an approach for the detection of coordinate- 
term relationships between entities from the software 
domain, that refer to Java classes. Usually, relations 
are found by examining corpus statistics associated with 
text entities. In some technical domains, however, 
we have access to additional information about the 
real-world objects named by the entities, suggesting 
that coupling information about the “grounded” entities 
with corpus statistics might lead to improved methods 
for relation discovery. To this end, we develop a 
similarity measure for Java classes using distributional 
information about how they are used in software, which 
we combine with corpus statistics on the distribution 
of contexts in which the classes appear in text. Using 
our approach, cross-validation accuracy on this dataset 
can be improved dramatically, from around 60% to 88%. 
Human labeling results show that our classifier has an 
FI score of 86% over the top 1000 predicted pairs. 


1 Introduction 


Discovering semantic relations between text entities 
is a key task in natural language understanding. It 
is a critical component which enables the success of 
knowledge representation systems such as TextRunner 
43 , ReVerb [8], and NELL [4|, which in turn are useful 
for a variety of NLP applications, including, temporal 
scoping [38], semantic parsing [20j and entity linking 
25 . 

In this work, we examine coordinate relations be¬ 
tween words. According to the WordNet glossary, X 
and Y are defined as coordinate terms if they share a 
common hypernym 10 27 . This is a symmetric rela¬ 


tion that indicates a semantic similarity, meaning that X 
and Y are “a type of the same thing”, since they share 
at least one common ancestor in some hypernym taxon¬ 
omy (to paraphrase the definition of Snow et al. 
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Semantic similarity relations are normally discov¬ 
ered by comparing corpus statistics associated with the 


entities: for instance, two entities X and Y that usually 
appear in similar contexts are likely to be semantically 
similar [7 32 33 . However, in technical domains, we 


have access to additional information about the real- 
world objects that are named by the entities: e.g., we 
might have biographical data about a person entity, or 
a 3D structural encoding of a protein entity. In such 
situations, it seems plausible that a ’’grounded” NLP 
method, in which corpus statistics are coupled with data 
on the real-world referents of X and Y, might lead to 
improved methods for relation discovery. 

Here we explore the idea of grounded relation dis¬ 
covery in the domain of software. In particular, we con¬ 
sider the detection of coordinate-term relationships be¬ 
tween entities that (potentially) refer to Java classes. 
We use a software domain text corpus derived from the 
Q&A website StackOverflow (SO), in which users ask 
and answer questions about software development, and 
we extract posts which have been labeled by users as 
Java related. From this data, we collected a small set of 
entity pairs that are labeled as coordinate terms (or not) 
based on high-precision Hearst patterns and frequency 
statistics, and we attempt to label these pairs using in¬ 
formation available from higher-recall approaches based 
on distributional similarity. 

We describe an entity linking method in order to 
map a given text entity to an underlying class type 
implementation from the Java standard libraries. Next, 
we describe corpus and code based information that 
we use for the relation discovery task. Corpus based 
methods include distributional similarity and string 
matching similarity. Additionally, we use two sources 
of code based information: (1) we define the class- 
context of a Java class in a given code repository, and are 
therefore able to calculate a code-based distributional 
similarity measure for classes, and (2) we consider 
the hierarchical organization of classes, described by 
the Java class type and namespace hierarchies. We 
demonstrate that using our approach, cross-validation 
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Figure 1: Visualization of predicted coordinate term pairs, where each pair of coordinate classes is connected 
by an edge. Highly connected components are labeled by edge color, and it can be noted that they contain 
classes with similar functionality. Some areas containing a functional class group have been magnified for easier 
readability. 


accuracy on this dataset is improved from 60.9% to 88%. 
According to human labeling, our classifier has an Fl- 
score of 86% over the highest-ranking 1000 predicted 
pairs. 

We see this work as a first step towards building 
a knowledge representation system for the software do¬ 
main, in which text entities refer to elements from a 
software code base, for example classes, methods, ap¬ 
plications and programming languages. Understanding 
software entity relations will allow the construction of 
a domain specific taxonomy and knowledge base, which 
can enable higher reasoning capabilities in NLP applica¬ 
tions for the software domain j: 3 ][ 29 |[ 4 T|[ 42 ] and improve 
a variety of code assisting applications, including code 
refactoring and token completion p~[[ 1 -~>{[TT ■{!]. 


Figure [l] shows a visualization based on coordinate 
term pairs predicted using the proposed method. Java 
classes with similar functionality are highly connected 
in this graph, indicating that our method can be used 
to construct a code taxonomy. 


2 Related Work 

Semantic Relation Discovery. Previous work on se¬ 
mantic relation discovery, in particular, coordinate term 
discovery, has used two main approaches. The first is 
based on the insight that certain lexical patterns in¬ 
dicate a semantic relationship with high-precision, as 
initially observed by Hearst 16 . For example, the con- 
juction pattern “X and Y” indicates that X and Y are 
coordinate terms. Other pattern-based classifier have 
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13 , synonyms [24 , and - 




Text ill 

The second approach i Wo . rd x 

+ 

Word Y 

+ 

Corpus]^ 


relies on the notion that words that appear in a sim¬ 
ilar context are likely to be semantically similar. In 
contrast to pattern based classifiers, context distribu¬ 
tional similarity approaches are normally higher in re¬ 
call. [7j[32][33[:36 . In this work we attempt to label 
samples extracted with high-precision Hearst patterns, 
using information from higher-recall methods. 

Grounded Language Learning. The aim of 
grounded language learning methods is to learn a map¬ 
ping between natural language (words and sentences) 
and the observed world 14 35] 44], where more recent 
work includes grounding language to the physical world 
19 , and grounding of entire discourses 28 . Early work 


pervision constraint has been gradually relaxed 18,23 


Relative to prior work on grounded language acquisi¬ 
tion, we use a very rich and complex representation of 
entities and their relationships (through software code). 
However, we consider a very constrained language task, 
namely coordinate term discovery. 

Statistical Language Models for Software. 
In recent work by NLP and software engineering 
researchers, statistical language models have been 
adapted for modeling software code. NLP models have 
been used to enhance a variety of software develop¬ 
ment tasks such as code and comment token comple¬ 
tion [T5fT7][29f34 , analysis of code variable names lp2] , 
and mining software repositories 11 . This has been 


complemented by work from the programming language 
research community for structured prediction of code 
To the best of our knowledge, there 
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syntax trees 

is no prior work on discovering semantic relations for 
software entities. 
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in this field relied on supervised aligned sentence-to- 
meaning data 12 45 . However, in later work the su¬ 


Figure 2: Classification Pipeline for determining 
whether nouns A and Y are coordinate terms. Each 
noun is mapped to an underlying class from the code 
repository with probability, p(Class|Word). Textual fea¬ 
tures are extracted based on the input words, code based 
features are extracted using the mapped classes, and all 
of these are given to the coordinate term classifier. 

mentation from the code repository, named X ', accord¬ 
ing to an estimated probability forp(Class A'[Word X), 
s.t., X' = maxcp(C\X), for all other classes C. X' is 
then the code referent of X. Similarly, we map Y to the 
class Y'. Given a code-based grounding for X and Y 
we extract information using the class implementations: 
(1) we define a code based distributional similarity mea¬ 
sure, using code-context to encode the usage pattern of 
a class, and (2) we use the hierarchical organization of 
classes, described by the type and namespace hierar¬ 
chies. Finally, we combine all the above information in 
a single SVM classifier. 

3.1 Baseline: Corpus Distributional Similarity. 

As an initial baseline we calculate the corpus distribu¬ 
tional similarity of nouns ( X , Y ), following the assump¬ 
tion that words with similar context are likely to be se¬ 
mantically similar. Our implementation follows Pereira 
et al. [33 . We calculate the empirical context distribu¬ 
tion for noun X 


3 Coordinate Term Discovery 

In this section we describe a coordinate term classifica¬ 
tion pipeline, as depicted at high-level in Figure [2] All 
the following steps are described in detail in the sections 
below. 

Given a software domain text corpus (StackOver- 
flow) and a code repository (Java Standard Libraries), 
our goal is to predict a coordinate relation for (A, Y ), 
where X and Y are nouns which potentially refer to 
Java classes. 

We first attempt a baseline approach of labeling the 
pair (A', Y) based on corpus distributional similarity. 
Since closely related classes often exhibit morphological 
closeness, we use as a second baseline the string simi¬ 
larity of A' and Y. 

Next, we map noun A to an underlying class irnple- 


(3.1) px = /(c,A)/^/(c',A) 

d 

where /(c, A) is the frequency of occurrence of noun A 
in context c. We then measure the similarity of nouns 
A and Y using the relative entropy or Kullback-Leibler 
divergence 

(3.2) D(p x 11 Py) =J2px{z) log 

z PY ^ Z > 

As this measure is not symmetric we finally consider 
the distributional similarity of A and Y as D{px \ \py ) + 
D{pv\\px)- 

3.2 Baseline: String Similarity. Due to naming 
convention standards, many related classes often exhibit 
some morphological closeness. For example, classes that 





























































provide Input/Output access to the file system will often 
contain the suffix Stream or Buffer. Likewise, many 
classes extend on the names of their super classes (e.g., 
JRadioButtonMenuItem extends the class JMenuItem). 
More examples can be found in Figure [T] and Table [4] 
We therefore include a second baseline which attempts 
to label the noun pair ( X , Y) as coordinate terms 
according to their string matching similarity. We use 
the SecondString open source Java toolkit^ Each string 
is tokenized by camel case (such that ArrayList is 
represented as Array List). We consider the SoftTFIDF 
distance of the tokenized strings, as defined by Cohen 
et al. 6 . 


3.3 Entity Linking. In order to draw code based 
information on text entities, we define a mapping func¬ 
tion between words and class types. Our goal is to find 
p{C\W), where C is a specific class implementation and 
W is a word. This mapping is ambiguous, for exam¬ 
ple, since users are less likely to mention the qualified 
class name (e.g., java.lang.String), and usually use 
the class label , meaning the name of the class not in¬ 
cluding its package (e.g., String). As an example, the 
terms java.lang.String and java.util.Vector ap¬ 
pears 37 and 1 times respectively in our corpus, ver¬ 
sus the terms String and Vector which appear 35K 
and 1.6K times. Additionally, class names appear with 
several variations, including, case-insensitive versions, 
spelling mistakes, or informal names (e.g., array instead 
of ArrayList). 

Therefore, in order to approximate p(C, W) in 


(3.3) 


p(C\W) 


p(C,W) 

p(W) 


We estimate a word to class-type mapping that is 
mediated through the class label, L, as 


(3.4) p(C,W)=p(C,L)-p(L,W) 

Since p[C,L) = p{C\L)p(L), this can be estimated by 
the corresponding MLEs 


P{C, L) 

(3.5) 


p(C\L)-p(L) 

f(C) f(L) 

Sc'ei f(C') J2 l' f(L') 


where /() is the frequency function. Note that since 

Ec'eL/( C ") = f( L ) we S et that P{°, L ) = p(C), as 
the class label is uniquely determined by the class qual¬ 
ified name (the opposite does not hold since multiple 
class types may correspond to the same label). Finally, 
the term p(L, W) is estimated by the symmetric string 


ARG- Method: Class is being passed as an 
argument to Method. We count an occurrence of 
this context once for the method definition 
Method(Class class, ...) 
as well as for each method invocation 
MethodCclass, ...) 

For example, given the statement 
str = toString(i); 

where i is an Integer, we would count an occur¬ 
rence for this class in the context ARG-toString. 

API -Method: Class provides the API method 
Method. We count an occurrence of this con¬ 
text once for the method definition, and for ev¬ 
ery occurrence of the method invocation, e.g. 
class.Method!...). 

For example, given the statement 
s = map.size(); 

where map is a HashMap, we would count an 
occurrence for this class in the context API-size. 


Table 1: Definition of two types of code-contexts for a 
class type, Class, or an instantiation of that type (e.g., 
class). 


distance between the two strings, as described in Sec¬ 
tion 3.2 We consider the linking probability of (. X , Y) 


to be p(X'\X) -p(Y'\Y), where X’ is the best matching 
class for X s.t. X' = maxcp(C|A) and similarly for 
Y’. 


3.4 Code Distributional Similarity. Corpus dis¬ 
tributional similarity evaluates the occurrence of words 
in particular semantic contexts. By defining the class- 
context of a Java class, we can then similarly calculate a 
code distributional similarity between classes. Our def¬ 
inition of class context is based on the usage of a class 
as an argument to methods and on the API which the 
class provides, and it is detailed in Table [I] We observe 
over 23K unique contexts in our code repository. Based 
on these definitions we can compute the distributional 
similarity measure between classes X' and Y' based on 
their code-context distributions, as previously described 
for the corpus distributional similarity (Section |3.1| fol¬ 
lowing Pereira et al. [33]). For the code-based case, we 
calculate the empirical context distribution of X' (see 


Equation 3.1) using /(c, A'), the occurrence frequency 
of class X' in context c, where c is one of the ARG- 
Method or APl-Method contexts (defined in Table |T|) 
for methods observed in the code repository. The dis¬ 
tributional similarity of (A'', Y') is then taken, using the 
relative entropy, as D(px' \ \py') + D{py'\\px')- 


i http: / / secondstring.sourceforge.net / 











3.5 Code Hierarchies and Organization. The 

words A' and Y are defined as coordinate terms if they 
have the same hypernym in a given taxonomy, meaning 
they have at least one common ancestor in this taxon- 
For the purpose of comparing two class types, 
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omy 

we therefore define an ancestry relation between them 
using two taxonomies based on the code namespace and 
type hierarchies. 

Package Taxonomy: A package is the standard 
way for defining namespaces in the Java language. It 
is a mechanism for organizing sets of classes which 
normally share a common functionality. Packages are 
organized in a hierarchical structure which can be easily 
inferred from the class name. For example, the class 
java.lang.String, belongs to the java.lang package, 
which belongs to the java package. 

Type Taxonomy: The inheritance structure of 
classes and interfaces in the Java language defines a type 
hierarchy, such that class A is the ancestor of class B if 
B extends or implements A. 

We define type-ancestry and package-ancestry rela¬ 
tions between classes (X based on the above tax¬ 
onomies. For the type taxonomy, 

At ype (X',Y') = {# of common ancestors X' 
and Y' share within n higher up levels in the 
type taxonomy} 

for n from 1 to 6. A™ ackage is defined similarly for the 
package taxonomy. As an example, 


-^package (ArrayList, Vector) = 2 


High PMI 

Low PMI 

(JTextField,JComboBox) 

(threads characters) 

(yearsPlayed,totalEarned) 

(server,user) 

(PostlnsertEventListener, 

PostUpdateEventListener) 

(code,design) 

(removeListener,addListener) 

(Java,client) 

(MinTreeMap,MaxTreeMap) 

(Eclipse,array) 


Table 2: Sample set of word pairs with high and low 
PMI scores. Many of the high PMI pairs refer to 
software entities such as variable, method and Java class 
names, whereas the low PMI pairs contain more general 
software terms. 


Mallet statistical NLP package 26 . In this study, we 


use only the text portions of the SO posts, and exclude 
all raw code segments, as indicated by the user-labeled 
<code> markup. Next, the text was POS tagged 
with the Stanford POS tagger 39 and parsed with the 


MaltParser 30 . Finally, we extract noun pairs with the 
conjunction dependencies: conj or inv-conj , a total of 
255,150 pairs, which we use as positive training samples. 

We use the Java standard libraries code repository 
as a grounding source for Java classes, as we expect that 
users will often refer to these classes in the Java tagged 
SO posts. This data includes: 7072 source code files, 
the implementation of 10562 class and interface types, 
and 477 packages. The code repository is parsed using 
the Eclipse JDT compiler tools, which provide APIs for 
accessing and manipulating Abstract Syntax Trees. 


as these classes both belong in the package java.util, 
and therefore their common level 2 ancestors are: java 
and java.util. Moreover, 


4.2 Classification. We follow the classification 
pipeline described in Figure [2j using the LibLinear 
SVM classifier |5} 91 with the following features: 


A\ ype (ArrayList, Vector) = 5 


Corpus-Based Features 


since both classes extend the AbstractList class, 
and also implement four joint interfaces: List, 
RandomAccess, Cloneable, and Serializable. 

4 Experimental Settings 

4.1 Data Handling. We downloaded a dump of the 
interactions on the StackOverflow websitt^ from its 
launch date in 2008 and until 2012. We use only the 
277K questions labeled with the user-assigned Java tag, 
and their 629K answers. 

Text from the SO html posts was extracted with 
the Apache Tika toolkit^] and then tokenized with the 

2 http: //www. clearbits.net/creators/146-stack-exchange-data- 
dump 

3 http: //tika.apache.org/ 


Corpus distributional similarity (Corpus Dist. 


Sim.) - see Section 3.1 


• String similarity (String Sim.) - see Sec¬ 
tion [SSI 


Code-Based Features 


Text to code linking probability (Text-to-code 


Prob.) - see Section 3.3 


Code distributional similarity (Code Dist. 


Sim.) - see Section 3.4 


Package and type ancestry (A* 


package 


^package and A \ype ~ A type ) ‘ See Section 


3.5 


Since the validity of the code based features above is 
directly related to the success of the entity linking phase, 















Method 

Coord 

Coord-PMI 

Code & Corpus 

85.3 

88 

Baselines: 



Corpus Dist. Sim. 

57.8 

58.2 

String Sim. 

65.2 

65.8 

Corpus Only 

64.7 

60.9 

Code Only 

80.1 

81.1 

Code Features: 



Code Dist. Sim. 

67 (60.2) 

67.2 (59) 

A 1 

packaqe 

64.2 (63.8) 

64.3 (63.9) 

A2 

package 

64.2 (63.8) 

61.2 (64.8) 

A 3 

^packaqe 

65.8 (64.3) 

66 (64.6) 

A 4 

packaqe 

52.5 (52) 

64.7 (58.7) 

Ao 

packaqe 

52.5 (52) 

52.6 (58.7) 

ab 

package 

50.4 (51.6) 

52.3 (52) 

A 1 

type 

51.4 (51.4) 

55.1 (53.7) 

A 2 

type 

54 (53.9) 

55.5 (54.3) 

A 3 

type 

56.8 (56.7) 

57 (56.9) 

A 4 

type 

57.1 (56.9) 

57.3 (57.1) 

A 5 
type 

57.4 (57.6) 

58 (57.9) 

A 6 

type 

57.2 (57.4) 

57.5 (57.5) 

Text-to-code Prob. 

55.7 

55.8 


Table 3: Cross validation accuracy results for the co¬ 
ordinate term SVM classifier (Code & Corpus), as well 
as baselines using corpus distributional similarity, string 
similarity, all corpus based features (Corpus Only), or 
all code based features (Code Only), and all individ¬ 
ual code based features. The weighted version of the 
code based features (see Section 4.2) is in parenthesis. 
Results are shown for both the Coord and Coord-PMI 
datasets. 


each of the code based features are used in the classifier 
once with the original value and a second time with the 
value weighted by the text to code linking probability. 

Of the noun pairs ( X , V) in our data, we keep only 
pairs for which the linking probability p{X'\X) •p(Y'\Y) 
is greater than 0.1. Note that this guarantees that each 
noun must be mapped to at least one class with non-zero 
probability. Next, we evaluate the string morphology 
and its resemblance to a camel-case format, which is 
the acceptable formatting for Java class names. We 
therefore select alphanumeric terms with at least two 
upper-case and one lower-case characters. We name this 
set of noun pairs the Coord dataset. 

A key assumption underlying statistical distribu¬ 
tional similarity approaches is that “high-interest” enti¬ 
ties are associated with higher corpus frequencies, there¬ 
fore, given sufficient statistical evidence “high-interest” 
relations can be extracted. In the software domain, real 


world factors may introduce biases in a software-focused 
text corpus which may affect the corpus frequencies of 
classes: e.g., users may discuss classes based on the 
clarity of their API, the efficiency of their implemen¬ 
tation, or simply if they are fundamental in software 
introduced to novice users. Another motivation for us¬ 
ing grounded data, such as the class implementation, 
is that it may highlight additional aspects of interest, 
for example, classes that are commonly inherited from. 
We therefore define a second noun dataset, Coord-PMI , 
which attempts to address this issue, in which noun 
pairs are selected based on their pointwise mutual in¬ 
formation ( PMI): 

(4.6) PMI(X,Y) = 

where the frequency of the pair (A', Y) in the corpus is 
positive. In this set we include coordinate term pairs 
with high PMI scores, which appear more rarely in 
the corpus and are therefore harder to predict using 
standard NLP techniques. The negative set in this data 
are noun pairs which appear frequently separately but 
do not appear as coordinate terms, and are therefore 
marked by low PMI scores. 

To illustrate this point, we provide a sample of noun 
pairs with low and high PMI scores in Table [2j where 
pairs highlighted with bold font are labeled as coordi¬ 
nate terms in our data. We can see that the high PMI 
set contains pairs that are specific and interesting in the 
software domain while not necessarily being frequent 
words in the general domain. For example, some pairs 
seem to represent variable names (e.g., ( yearsPlayed , 
totalEarned )), others likely refer to method names 
(e.g., ( removeListener, addListener)). Some pairs re¬ 
fer to Java classes, such as ( JTextField, JComboBox) 
whose implementation can be found in the Java code 
repository. We can also see examples of pairs such 
as ( PostlnsertEventListener, PostUpdateEventListener) 
which are likely to be user-defined classes with a rela¬ 
tionship to the Java class java.util.EventListener. 
In contrast, the low PMI set contains more general soft¬ 
ware terms (e.g., code, design, server, threads). 

5 Results 

5.1 Classification and Feature Analysis. In Ta¬ 
ble [3] we report the cross validation accuracy of the 
coordinate term classifier ( Code & Corpus) as well as 
baseline classifiers using corpus distributional similarity 
(Corpus Dist. Sim.), string similarity ( String Sim.), all 
corpus features ( All Corpus), or all code features ( All 
Code). Note that using all code features is significantly 
more successful on this data than any of the corpus 
baselines (corpus baselines’ accuracy is between 57%- 











Code Dist. Sim 

A 3 

package 

A 5 

type 

(FileOutputStream,OutputStream) 

(Key Event,KeyListener) 

(JMenuItem, JMenu) 

{AffineTransform, Affine 1 TransformOp) 

(Sty leConst ants, Simple AttributeSet) 

(JMenuItems, JMenu) 

(GZIPOutputStream, 

DeflaterOutputStream) 

(BlockQueue,ThreadPoolExecutor) 

(JMenuItems,JMenus) 

(OutputStream,DataOutputStream) 

(BufferedImage,WritableRaster) 

(JLabel,DefaultTreeCellRenderer) 

(AtomicInteger,AtomicInteger Array) 

(MouseListener, Mouse WheelListener) 

(JToggleButton,JRadioButtonMenu) 

(ResourceBundle,ListResourceBundle) 

(DocumentBuilderFactory, 

DocumentBuilder) 

(JFrame,JDialogs) 

(setIconImages,setIconImage) 

(ActionListeners,FocusListeners) 

(JTable,JTableHeader) 

(ComboBoxModel, 

DefaultComboBoxModel) 

(DataInputStream,DataOutputStream) 

(JText Area, JEditorPane) 

(JText Area,Text Area) 

(greaterOrEqualThan,lesserOrEqualThan) 

(JTextPane,JEditorPane) 

(ServerSocketChannel,SocketChannel) 

(CopyOnWriteArrayList, 

ConcurrentLinkedQueue) 

(JText Area, JTable) 


Table 4: Top ten coordinate terms predicted by classifiers using one of the following features: code distributional 
similarity, package hierarchy ancestry {A package ), and type hierarchy ancestry (A^ ype ). All of the displayed 
predictions are true. 



Figure 3: Manual Labeling Results. FI results of the 
top 1000 predicted coordinate terms by rank. The final 
data point in each line is labeled with the FI score at 
rank 1000. 


65% whereas code-based accuracy is over 80%). When 
using both data sources, performance is improved even 
further (to over 85% on the Coord dataset and 88% on 
Coord-PMI ). 

We provide an additional feature analysis in Ta¬ 
ble [3j and report the cross validation accuracy of clas¬ 
sifiers using each single code feature. Interestingly, 
code distributional similarity (Code Dist. Sim.) is the 
strongest single feature, and it is a significantly better 
predictor than corpus distributional similarity, achiev¬ 
ing around 67% v.s. around 58% for both datasets. 


5.2 Evaluation by Manual Labeling. The cross- 
validation results above are based on labels extracted 
using Hearst conjunction patterns. In Figure [3] we pro¬ 
vide an additional analysis based on manual human la¬ 
beling of samples from the Coord-PMI dataset, follow¬ 
ing a procedure similar to prior researchers exploring 
semi-supervised methods for relation discovery [4 
After all development was complete, we hand labeled 
the top 1000 coordinate term pairs according to the 
ranking by our full classifier (using all code and corpus 
features) and the top 1000 pairs predicted by the classi¬ 
fiers based on code and corpus distributional similarities 
only. We report the FI results of each classifier by the 
rank of the predicted samples. According to our anal¬ 
ysis, the FI score for the text and code distributional 
similarity classifiers degrades quickly after the first 100 
and 200 top ranked pairs, respectively. At rank 1000, 
the score of the full classifier is at 86%, whereas the code 
and text classifiers are only at 56% and 28%. 

To highlight the strength of each of the code based 
features, we provide in Table [4] the top ten coordinate 
terms predicted using the most successful code based 
features. For example, the top prediction using type hi¬ 
erarchy ancestry {A\ ype ) is (JMenuItem, JMenu). Since 
JMenu extends JMenuItem, the two classes indeed share 
many common interfaces and classes. Alternatively, all 
of the top predictions using the package hierarchy an¬ 
cestry ( A package ) are labels that have been matched to 
pairs of classes that share at least 3 higher up package 
levels. So for example, BlockQueue has been matched 
to java.util.concurrent.BlockingQueue which was 
predicted as a coordinate term of ThreadPoolExecutor 
which belongs in the same package. Using code dis- 
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tributional similarity, one of the top predictions is 
the pair (GZIPOutputStreanr, DeflaterOutputStream), 
which share many common API methods such as write, 
flush, and close. Many of the other top predicted 
pairs by this feature have been mapped to the same 
class and therefore have the exact same context distri¬ 
bution. 

5.3 Taxonomy Construction. We visualize the co¬ 
ordinate term pairs predicted using our method (with all 
features), by aggregating them into a graph where enti¬ 
ties are nodes and edges are determined by a coordinate 
term relation (Figure |T]). Graph edges are colored using 
the Louvain method p] for community detection and an 
entity label’s size is determined by its betweenness cen¬ 
trality degree. We can see that high-level communities 
in this graph correspond to class functionality, indicat¬ 
ing that our method can be used to create an interesting 
code taxonomy. 

Note that our predictions also highlight connections 
within functional groups that cannot be found using 
the package or type taxonomies directly. One example 
can be highlighted within the GUI functionality group. 
Listener classes facilitate a response mechanism to 
GUI Actions, such as pressing a button, or entering 
text, however, these classes belong in different packages 
than basic GUI components for historical reasons. In 
our graph, Action and Listener classes belong to the 
same communities of the GUI components they are 
normally used with. 

6 Conclusions 

We have presented an approach for grounded discovery 
of coordinate term relationships between text entities 
representing Java classes. Using a simple entity link¬ 
ing method we map text entities to an underlying class 
type implementation from the Java standard libraries. 
With this code-based grounding, we extract information 
on the usage pattern of the class and its location in the 
Java class and namespace hierarchies. Our experimental 
evaluation shows that using only corpus distributional 
similarity for the coordinate term prediction task is un¬ 
successful, achieving prediction accuracy of around 58%. 
However, adding information based on the entities’ soft¬ 
ware implementation improves accuracy dramatically to 
88%. Our classifier has an FI score of 86% according to 
human labeling over the top 1000 predicted pairs. We 
have shown that our predictions can be used to build an 
interesting code taxonomy which draws from the func¬ 
tional connections, common usage patterns, and imple¬ 
mentation details that are shared between classes. 
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